Anomaly Detection: The Local Outlier Factor (LOF) Model

Introductory Remarks

Anomalies are data points that are different from other observations in some way, typically measured against a model fit to the data. On the contrary with the ordinary descriptive statistics, we are interested here to found where these anomalous data points exist and not exclude them as outliers.

We assume the anomaly detection task is unsupervised, i.e. we don’t have training data with points labeled as anomalous. Each data point passed to an anomaly detection model is given a score indicating how different the point is relative to the rest of the dataset. The calculation of this score varies between models, but a higher score always indicates a point is more anomalous. Often a threshold is chosen to make a final classification of each point as typical or anomalous; this post-processing step is left to the user.

The GraphLab Create (GLC) Anomaly Detection toolkit currently includes three models for two different data contexts:

  • Local Outlier Factor, for detecting outliers in multivariate data that are assumed to be independently and identically distributed,
  • Moving Z-score, for scoring outliers in a univariate, sequential dataset, typically a time series, and
  • Bayesian Changepoints for identifying changes in the mean or variance of a sequential series.

In this short note, we demonstrate how the GLC Local Outlier Factor Model can be used to reveal anomalies in a multivariate data set. We will use the customer data from a recent AirBnB New User Bookings competition on Kaggle. More specifically, we have downloaded a copy of the file train_users_2.csv in our working directory. Each row in this dataset describes one of 213,451 AirBnB users; there is a mix of basic features, such as gender, age, and preferred language, as well as the user's "technology profile", including the browser type, device type, and his/her sign-up method.

Libraries and Necessary Data Transformation

First, we fire up GraphLab Create, all the other necessary libraries for our study, and load the train_users_2.csv file in a SFrame.


In [1]:
import graphlab as gl
from visualization_helper_functions import *


[INFO] graphlab.cython.cy_server: GraphLab Create v1.10.1 started. Logging: /tmp/graphlab_server_1466532393.log
INFO:graphlab.cython.cy_server:GraphLab Create v1.10.1 started. Logging: /tmp/graphlab_server_1466532393.log
This non-commercial license of GraphLab Create is assigned to tgrammat@gmail.com and will expire on September 21, 2016. For commercial licensing options, visit https://dato.com/buy/.

In [2]:
customer_data = gl.SFrame.read_csv('./train_users_2.csv')


Finished parsing file /home/theod/Documents/ML_Home/12.DatoPy/01.R.Anomaly_Detection/train_users_2.csv
Parsing completed. Parsed 100 lines in 4.11956 secs.
Finished parsing file /home/theod/Documents/ML_Home/12.DatoPy/01.R.Anomaly_Detection/train_users_2.csv
Parsing completed. Parsed 213451 lines in 2.96087 secs.
------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,int,str,str,float,str,int,str,str,str,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------

In [3]:
customer_data.head(5)


Out[3]:
id date_account_created timestamp_first_active date_first_booking gender age signup_method signup_flow
gxn3p5htnn 2010-06-28 20090319043255 -unknown- None facebook 0
820tgsjxq7 2011-05-25 20090523174809 MALE 38.0 facebook 0
4ft3gnwmtx 2010-09-28 20090609231247 2010-08-02 FEMALE 56.0 basic 3
bjjt8pjhuk 2011-12-05 20091031060129 2012-09-08 FEMALE 42.0 facebook 0
87mebub9p4 2010-09-14 20091208061105 2010-02-18 -unknown- 41.0 basic 0
language affiliate_channel affiliate_provider first_affiliate_tracked signup_app first_device_type first_browser
en direct direct untracked Web Mac Desktop Chrome
en seo google untracked Web Mac Desktop Chrome
en direct direct untracked Web Windows Desktop IE
en direct direct untracked Web Mac Desktop Firefox
en direct direct untracked Web Mac Desktop Chrome
country_destination
NDF
NDF
US
other
US
[5 rows x 16 columns]

For the needs of our current presentation we will only need a small subset of the available basic customer features, i.e. 'gender', 'age' and 'language'.


In [4]:
features = ['gender', 'age', 'language']
customer_data = customer_data[['id']+features]
customer_data


Out[4]:
id gender age language
gxn3p5htnn -unknown- None en
820tgsjxq7 MALE 38.0 en
4ft3gnwmtx FEMALE 56.0 en
bjjt8pjhuk FEMALE 42.0 en
87mebub9p4 -unknown- 41.0 en
osr2jwljor -unknown- None en
lsw9q7uk0j FEMALE 46.0 en
0d01nltbrs FEMALE 47.0 en
a1vcnhxeij FEMALE 50.0 en
6uh8zyj2gn -unknown- 46.0 en
[213451 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

From the quick exploratory data analysis below:


In [5]:
%matplotlib inline
univariate_summary_plot(customer_data, features, nsubplots_inrow=3, subplots_wspace=0.7)


Summary Statistics:

           gender            age language
count      213451  125461.000000   213451
unique          4            NaN       25
top     -unknown-            NaN       en
freq        95688            NaN   206314
mean          NaN      49.668335      NaN
std           NaN     155.666612      NaN
min           NaN       1.000000      NaN
25%           NaN      28.000000      NaN
50%           NaN      34.000000      NaN
75%           NaN      43.000000      NaN
max           NaN    2014.000000      NaN

In [6]:
gl.canvas.set_target('browser')
customer_data[['age']].show()


Canvas is accessible via web browser at the URL: http://localhost:34510/index.html
Opening Canvas in default web browser.

In [7]:
print 'Number of customer records with ages larger than 2013: %d' %\
len(customer_data[customer_data['age'] >= 2013])


Number of customer records with ages larger than 2013: 749

we notice that there about 750 records having an 'age' value of '2013' or '2014', which is of course wrong. Most probably the year was recorded accidentally in this field. The remaining 'age' values seams absolutely reasonable with only some rare customer entries that have ages greater than '100'. In fact more than 128 thousand customer entries are found to have ages in the [1, 142] interval. More specifically, we have choosen to assume any value falling in the [1,150] interval as an elligible recording of a customer age, re-assigning all the remaining ones as missing:


In [8]:
customer_data['age'] = customer_data['age'].apply(lambda age: age if age < 150 else None)
customer_data = customer_data.dropna(columns = features, how='any')
print 'Number of Rows in dataset: %d' % len(customer_data)


Number of Rows in dataset: 124681

Now, the univariate summary statistics of the customer_data set takes the form:


In [9]:
univariate_summary_plot(customer_data, features, nsubplots_inrow=3, subplots_wspace=0.7)


Summary Statistics:

        gender            age language
count   124681  124681.000000   124681
unique       4            NaN       25
top     FEMALE            NaN       en
freq     57247            NaN   120173
mean       NaN      37.412629      NaN
std        NaN      13.954917      NaN
min        NaN       1.000000      NaN
25%        NaN      28.000000      NaN
50%        NaN      34.000000      NaN
75%        NaN      43.000000      NaN
max        NaN     132.000000      NaN

and more specifically the remaining customer ages follow the distribution below:


In [10]:
# transform the SFrame into a Pandas DataFrame
customer_data_df = customer_data.to_dataframe()
customer_data_df['gender'] = customer_data_df['gender'].astype(str)
customer_data_df['age'] = customer_data_df['age'].astype(float)
customer_data_df['language'] = customer_data_df['language'].astype(str)

In [12]:
# define seaborn style, palette, color codes
sns.set(style="whitegrid", palette="deep",color_codes=True)
# initialize the matplotlib figure
plt.figure(figsize=(12,7))

# draw distplot
ax1 = sns.distplot(customer_data_df.age, bins=None, hist=True, kde=False, rug=False, color='b')


If we would like to explore in more detail the countplot for the variable language, we can temporarily exclude the english-speaking customers and redraw the graph:


In [13]:
# exclude the english-speaking customers
customer_data_df_nen = customer_data_df[customer_data_df['language']!='en']

# define seaborn style, palette, color codes
sns.set(style="whitegrid", palette="deep",color_codes=False)
# initialize the matplotlib figure
plt.figure(figsize=(7,11))
plt.ylabel('language', {'fontweight': 'bold'})
plt.title('Countplot of Customer Languages\n[English-speaking people excluded]',
          {'fontweight': 'bold'})

# draw countplot
ax2 = sns.countplot(y='language', data=customer_data_df_nen, palette='deep', color='b')


The univariate summary statistics plot for this new customer_data_df_nen set is as follows.


In [14]:
univariate_summary_plot(customer_data_df_nen, features, subplots_wspace=0.7)


Summary Statistics:

        gender          age language
count     4508  4508.000000     4508
unique       4          NaN       24
top     FEMALE          NaN       zh
freq      2152          NaN      912
mean       NaN    33.222050      NaN
std        NaN    13.060345      NaN
min        NaN     5.000000      NaN
25%        NaN    25.000000      NaN
50%        NaN    30.000000      NaN
75%        NaN    38.000000      NaN
max        NaN   110.000000      NaN

The data set of interest, customer_data, has two nominal categorical variables:

  • 'gender': nominal categorical attribute (FEMALE/MALE/unknown/OTHER)
  • 'language': nominal categorical attribute of 25 different languages.

which we should better encode them prior of applying any learning algorithm. To do so we will apply the OneHotEncoding transformation as shown below:


In [15]:
one_hot_encoder = gl.toolkits.feature_engineering.OneHotEncoder(features=['gender', 'language'])
customer_data1 = one_hot_encoder.fit_transform(customer_data)

Local Outlier Factor (LOF) Models are distance-based learning algorithms. Therefore, we need to standardize the 'age' feature in order to be on roughly the same scale as the encoded categorical variables.


In [16]:
customer_data1['age'] = (customer_data['age'] - customer_data['age'].mean())/\
customer_data['age'].std()
customer_data1


Out[16]:
id age encoded_features
820tgsjxq7 0.0420907775098 {2: 1, 28: 1}
4ft3gnwmtx 1.33196384402 {3: 1, 28: 1}
bjjt8pjhuk 0.328729236733 {3: 1, 28: 1}
87mebub9p4 0.257069621927 {1: 1, 28: 1}
lsw9q7uk0j 0.615367695957 {3: 1, 28: 1}
0d01nltbrs 0.687027310763 {3: 1, 28: 1}
a1vcnhxeij 0.90200615518 {3: 1, 28: 1}
6uh8zyj2gn 0.615367695957 {1: 1, 28: 1}
yuuqmid2rp -0.101228452102 {3: 1, 28: 1}
om1ss59ys8 0.687027310763 {3: 1, 28: 1}
[124681 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Training a Local Outlier Factor (LOF) Model

Next, we train the LOF model by using this transformed customer_data2 set.


In [17]:
model_lof = gl.anomaly_detection.local_outlier_factor.create(customer_data1, 
                                                             features = ['age', 'encoded_features'],
                                                             threshold_distances=True,
                                                             verbose=False)

In [18]:
model_lof.save('./model_lof')

model_lof = gl.load_model('./model_lof/')


In [19]:
print 'The LOF model has been trained with the following options:'
print '-------------------------------------------------------------'
print model_lof.get_current_options()


The LOF model has been trained with the following options:
-------------------------------------------------------------
{'distance': [[['encoded_features'], 'jaccard', 1.0], [['age'], 'euclidean', 1.0]], 'verbose': False, 'num_neighbors': 5, 'threshold_distances': True}

Note that the model can automatically choose a suitable metric for the data type of the features we have available. Here, a composite distance of a 'jaccard' and 'euclidean' metric has been chosen for the 'encoded_features' and the 'age' columns respectively. Both these two metrics have been weighted with 1.0.

If we want what has been built by the model internally we can simply write:


In [20]:
print model_lof


Class                                   : LocalOutlierFactorModel

Schema
------
Number of examples                      : 124681
Number of feature columns               : 2
Number of neighbors                     : 5
Use thresholded distances               : True
Number of distance components           : 2
Row label name                          : row_id

Training summary
----------------
Total training time (seconds)           : 2467.8927

Accessible fields
-----------------
nearest_neighbors_model                 : Model used internally to compute nearest neighbors.
scores                                  : Local outlier factor for each row in the input dataset.

More importantly, here is the SFrame with the LOF anomaly scores:


In [21]:
model_lof['scores']


Out[21]:
row_id density anomaly_score neighborhood_radius
0 inf nan 0.0
1 inf nan 0.0
2 inf nan 0.0
3 inf nan 0.0
4 inf nan 0.0
5 inf nan 0.0
6 inf nan 0.0
7 inf nan 0.0
8 inf nan 0.0
9 inf nan 0.0
[124681 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Firstly, note that the model worked successfully, scoring each of the 124,681 input rows. Secondly, the anomaly score for many observations in our AirBnB dataset is nan which indicates the point has many neighbors at exactly the same location, making the ratio of densities undefined. These points cannot be outliers.

However, for the problem at hand we are interested to find if any outliers exist and under what circumstances this happens. This is where the real business value exists!

Using the LOF Model to detect anomalies

There are two common ways to detect which observations of your data set are anomalous or not:

A. Ask from the trained model to return the k more anomalous observations:

By applying the .topk() method of the model scores SFrame

In [22]:
top10_anomalies = model_lof['scores'].topk('anomaly_score', k=10)
top10_anomalies.print_rows(num_rows=10)


+--------+---------------+---------------+---------------------+
| row_id |    density    | anomaly_score | neighborhood_radius |
+--------+---------------+---------------+---------------------+
|  787   | 13.9548615034 |      inf      |   0.0716596148059   |
|  3678  | 13.9548615034 |      inf      |   0.0716596148059   |
|  5328  | 1.63764535897 |      inf      |    0.666666666667   |
|  6528  | 2.09512063742 |      inf      |    0.666666666667   |
|  8788  | 13.9548615034 |      inf      |   0.0716596148059   |
|  9626  | 13.9548615034 |      inf      |   0.0716596148059   |
| 10083  |      1.5      |      inf      |    0.666666666667   |
| 10727  | 13.9548615034 |      inf      |   0.0716596148059   |
| 10765  | 13.9548615034 |      inf      |   0.0716596148059   |
| 11038  |      1.5      |      inf      |    0.666666666667   |
+--------+---------------+---------------+---------------------+
[10 rows x 4 columns]

Note that the anomaly scores for these points are infinite, which happens when a point is next to several identical points, but is not itself a member of that bunch. These points are certainly anomalous, but our specific choice of k was arbitrary and excluded many points that are also likely anomalous.

B. Choose a threshold, either from domain knowledge or scientific expertise in order to find the anomalous observations in your data set:

observations with 'anomaly_score' greater than this 'threshold' will be the anomalous ones.

Of course, a closer look at the distribution of the anomaly_scores may help us a lot with this decision.


In [23]:
anomaly_scores_sketch = model_lof['scores']['anomaly_score'].sketch_summary()
print anomaly_scores_sketch


+--------------------+--------+----------+
|        item        | value  | is exact |
+--------------------+--------+----------+
|       Length       | 124681 |   Yes    |
|        Min         | 0.865  |   Yes    |
|        Max         |  inf   |   Yes    |
|        Mean        |  nan   |   Yes    |
|        Sum         |  inf   |   Yes    |
|      Variance      |  nan   |   Yes    |
| Standard Deviation |  nan   |   Yes    |
|  # Missing Values  |   0    |   Yes    |
|  # unique values   |  580   |    No    |
+--------------------+--------+----------+

Most frequent items:
+-------+-----+-----+----------------+------+----------------+----------------+
| value | 1.0 | inf | 0.966666666667 | 1.08 | 0.933333333333 | 0.942857142857 |
+-------+-----+-----+----------------+------+----------------+----------------+
| count | 643 | 193 |       29       |  26  |       19       |       18       |
+-------+-----+-----+----------------+------+----------------+----------------+
+----------------+------+------+------+
| 0.885714285714 | 1.16 | 1.07 | 1.11 |
+----------------+------+------+------+
|       15       |  14  |  12  |  11  |
+----------------+------+------+------+

Quantiles: 
+-------+----------------+----------------+-----+-----+---------------+-----+
|   0%  |       1%       |       5%       | 25% | 50% |      75%      | 95% |
+-------+----------------+----------------+-----+-----+---------------+-----+
| 0.865 | 0.885714285714 | 0.936666666667 | 1.0 | 1.0 | 1.21333333333 | inf |
+-------+----------------+----------------+-----+-----+---------------+-----+
+-----+------+
| 99% | 100% |
+-----+------+
| inf | inf  |
+-----+------+


In [24]:
threshold = anomaly_scores_sketch.quantile(0.9)
anomalies_mask = model_lof['scores']['anomaly_score'] >= threshold
anomalies = model_lof['scores'][anomalies_mask]
print 'Threshold: %.5f' % threshold, '\nNumber of Anomalies: %d' % len(anomalies)


Threshold: inf 
Number of Anomalies: 193

In [25]:
anomalies.print_rows(num_rows=10)


+--------+---------------+---------------+---------------------+
| row_id |    density    | anomaly_score | neighborhood_radius |
+--------+---------------+---------------+---------------------+
|  787   | 13.9548615034 |      inf      |   0.0716596148059   |
|  3678  | 13.9548615034 |      inf      |   0.0716596148059   |
|  5328  | 1.63764535897 |      inf      |    0.666666666667   |
|  6528  | 2.09512063742 |      inf      |    0.666666666667   |
|  8788  | 13.9548615034 |      inf      |   0.0716596148059   |
|  9626  | 13.9548615034 |      inf      |   0.0716596148059   |
| 10083  |      1.5      |      inf      |    0.666666666667   |
| 10727  | 13.9548615034 |      inf      |   0.0716596148059   |
| 10765  | 13.9548615034 |      inf      |   0.0716596148059   |
| 11038  |      1.5      |      inf      |    0.666666666667   |
+--------+---------------+---------------+---------------------+
[193 rows x 4 columns]

Finally, we can filter out the customer_data set by the anomalies['row_id'] to obtain the original features of these anomalous data points in record.


In [26]:
customer_data = customer_data.add_row_number(column_name='row_id')
anomalous_customer_data = customer_data.filter_by(anomalies['row_id'], 'row_id')
anomalous_customer_data.print_rows(num_rows=200)


+--------+------------+-----------+-------+----------+
| row_id |     id     |   gender  |  age  | language |
+--------+------------+-----------+-------+----------+
|  787   | w6i3ix717s |   OTHER   |  36.0 |    en    |
|  3678  | jwzspk0ipl |    MALE   |  39.0 |    zh    |
|  5328  | eqsihtnz34 |   FEMALE  |  36.0 |    hu    |
|  6528  | dyu0sssqo5 | -unknown- |  47.0 |    nl    |
|  8788  | 91vfcvol82 |    MALE   |  91.0 |    en    |
|  9626  | t6fvmrna0t |    MALE   |  98.0 |    en    |
| 10083  | n45ipduv9i |    MALE   |  28.0 |    fi    |
| 10727  | 9zhr7vpciy |    MALE   |  39.0 |    fr    |
| 10765  | lerui8bp4h |   FEMALE  |  88.0 |    en    |
| 11038  | h0cf46ubyt |    MALE   |  27.0 |    fi    |
| 12293  | unnvgq3efo |    MALE   |  40.0 |    pl    |
| 13926  | 1yoqktv6n6 |   OTHER   |  36.0 |    en    |
| 13980  | 2a9z5icq6y |    MALE   |  39.0 |    de    |
| 14044  | oyr9d8w1ig |   OTHER   |  39.0 |    en    |
| 15897  | lqf1twcvos |    MALE   | 101.0 |    en    |
| 17154  | c817bnjsp4 |    MALE   |  34.0 |    th    |
| 18260  | u0c8pp8dow |   FEMALE  |  19.0 |    de    |
| 18774  | t5g7yx6sks |   FEMALE  |  27.0 |    sv    |
| 19324  | y5l9io8veg |    MALE   |  91.0 |    en    |
| 20849  | n8f00fxpav |   OTHER   |  39.0 |    en    |
| 21232  | 0xrkw4fyw2 |   FEMALE  |  33.0 |    ru    |
| 21279  | 2scyrludwh |    MALE   |  98.0 |    en    |
| 21557  | kfdibfstle |   FEMALE  |  27.0 |    sv    |
| 22354  | 4ha5obt82l |    MALE   |  35.0 |    it    |
| 22816  | pscm1xlz37 |    MALE   |  91.0 |    en    |
| 24235  | 2rvp3se9j9 |   FEMALE  |  34.0 |    de    |
| 25081  | p7puqntomm |    MALE   |  40.0 |    ko    |
| 25101  | 2g6hrnhnb6 |    MALE   |  94.0 |    en    |
| 26681  | 6wf96zdcvk |    MALE   |  98.0 |    en    |
| 26786  | p21uet4l05 |   FEMALE  |  26.0 |    ru    |
| 26962  | bj5g8lixyq |   FEMALE  |  88.0 |    en    |
| 27673  | rzsz6n5i05 |   FEMALE  |  34.0 |    de    |
| 28426  | uipdrp7drt |   OTHER   |  44.0 |    en    |
| 29007  | nnqjh2u2re |    MALE   |  39.0 |    de    |
| 30375  | ko40ov8fsf |   FEMALE  |  35.0 |    el    |
| 30737  | q5ljs1sqyq |    MALE   |  39.0 |    de    |
| 30740  | 5msz5ddlxi |    MALE   |  35.0 |    it    |
| 30777  | ho9ag8jmbi |    MALE   |  94.0 |    en    |
| 31305  | 7ipdgkwscn |    MALE   |  23.0 |    de    |
| 31616  | qf21bmidle |   FEMALE  |  27.0 |    de    |
| 31986  | 8vn7n6732o |    MALE   |  36.0 |    es    |
| 32879  | 30mrn0pd33 |    MALE   |  25.0 |    ru    |
| 33724  | mh7ykpn147 |    MALE   | 101.0 |    en    |
| 34950  | bfzxxhwhni |   FEMALE  |  36.0 |    hu    |
| 36948  | iknls6q14s |    MALE   |  39.0 |    fr    |
| 38244  | dmccxwu3sl |   FEMALE  |  64.0 |    de    |
| 40214  | 9uu7cyhq1v |    MALE   |  25.0 |    ru    |
| 41241  | bpmp74acdf |   FEMALE  |  32.0 |    th    |
| 42377  | 4s5y11lmre |    MALE   |  40.0 |    ko    |
| 42546  | rxfwhr3158 |   OTHER   |  44.0 |    en    |
| 43225  | irj9i2n0nx |   FEMALE  |  22.0 |    sv    |
| 44091  | mx2eqy1cpk |    MALE   | 108.0 |    en    |
| 45005  | mvqqbw343t |   OTHER   |  36.0 |    en    |
| 45033  | iudbydxi6o |    MALE   |  39.0 |    zh    |
| 45047  | j24sw5v572 |   FEMALE  |  88.0 |    en    |
| 46339  | s61x07vvsg |    MALE   |  94.0 |    en    |
| 46375  | roy5fk3ez5 |    MALE   |  27.0 |    ru    |
| 46502  | pv1wki3itf | -unknown- |  15.0 |    en    |
| 47261  | 98w8a1t65x |   FEMALE  |  33.0 |    ru    |
| 47336  | k1g3iax7gd |    MALE   |  92.0 |    en    |
| 47738  | l2hbp7tg6g |    MALE   |  98.0 |    en    |
| 47838  | y8jm031su3 |    MALE   | 101.0 |    en    |
| 47886  | p69kdgrl2g |    MALE   |  39.0 |    de    |
| 47904  | a87wnoi5u7 |    MALE   |  39.0 |    zh    |
| 48480  | 874fy2hc0v |   FEMALE  |  40.0 |    fr    |
| 48796  | wc93nt5vok |    MALE   |  35.0 |    it    |
| 49777  | wc8bb71jnw |    MALE   |  25.0 |    ru    |
| 50889  | 7zrwfyh8yy |    MALE   |  35.0 |    it    |
| 51023  | fqeav2qj5i |   FEMALE  |  22.0 |    hu    |
| 51588  | f2hj5gw4c0 |   FEMALE  |  22.0 |    de    |
| 52592  | fq2x6nalo9 |   FEMALE  |  88.0 |    en    |
| 53554  | 0n07xj0qlx |    MALE   |  92.0 |    en    |
| 54821  | yiqo26yodm |   FEMALE  |  40.0 |    fr    |
| 55830  | n6ugg334eg |    MALE   |  71.0 |    fr    |
| 57747  | h6k3524pqn | -unknown- |  26.0 |    zh    |
| 58518  | mph6ldpg5p |   FEMALE  |  26.0 |    ru    |
| 59018  | xgg4udkocy |    MALE   | 108.0 |    en    |
| 59040  | axg2e2tgf4 |   FEMALE  |  19.0 |    de    |
| 59399  | z6qfisrre9 |   OTHER   |  39.0 |    en    |
| 59484  | unciu6n9s1 |   OTHER   |  51.0 |    en    |
| 60707  | mxgy2hi8lc |    MALE   |  26.0 |    cs    |
| 61707  | 9rw3sypabc |    MALE   |  23.0 |    de    |
| 63081  | u00c0qlf1o |   FEMALE  |  22.0 |    de    |
| 63436  | az9pobqi2g |   FEMALE  |  34.0 |    de    |
| 63895  | kr0l5h8j4i |    MALE   |  98.0 |    en    |
| 64820  | dtrg98vt27 |   FEMALE  |  40.0 |    fr    |
| 65385  | zm5p80tzgu |    MALE   |  39.0 |    fr    |
| 66865  | x66sn9ndsy |    MALE   |  27.0 |    th    |
| 67077  | 14z4t55a8l |    MALE   |  36.0 |    es    |
| 67979  | qzepyxvw0d |   FEMALE  |  34.0 |    de    |
| 69324  | uixx9403eo |   FEMALE  |  27.0 |    sv    |
| 69404  | uotgs2tnr1 |    MALE   | 108.0 |    en    |
| 71644  | dsr06wqj6v |    MALE   | 108.0 |    en    |
| 71976  | 877hi481jr |   FEMALE  |  26.0 |    ru    |
| 72079  | dj9v652mbc |   FEMALE  |  25.0 |    el    |
| 72318  | 6i55c93kup |    MALE   |  40.0 |    ko    |
| 72700  | sa94ou8bok | -unknown- |  34.0 |    fr    |
| 72912  | co08lr3pn3 |   FEMALE  |  37.0 |    ko    |
| 73235  | 1s3cid1010 |    MALE   |  28.0 |    th    |
| 73627  | m4dfh5jm3v | -unknown- |  15.0 |    en    |
| 73968  | zkwakc1i08 |    MALE   |  42.0 |    ko    |
| 75053  | 29ql6k9he8 |   OTHER   |  51.0 |    en    |
| 75421  | ry4uemic5o |    MALE   |  39.0 |    fr    |
| 76863  | 1zds91p8m9 |    MALE   |  46.0 |    da    |
| 77000  | vio6n04q46 |   FEMALE  |  45.0 |    da    |
| 77838  | fwrysylzt1 |    MALE   |  92.0 |    en    |
| 78156  | bt91ucqv6m |    MALE   |  91.0 |    en    |
| 79167  | xy8qb61c1r |   FEMALE  |  26.0 |    ru    |
| 79375  | puvgvvu6xs |    MALE   |  35.0 |    it    |
| 79592  | 0y57yi1sc1 |   FEMALE  |  41.0 |    es    |
| 80050  | 2flgaa2ub3 |   FEMALE  |  37.0 |    el    |
| 81516  | 4ewysjr8sp |   FEMALE  |  26.0 |    ru    |
| 81853  | j4lbte90v8 |   OTHER   |  51.0 |    en    |
| 82867  | erlamwz51w |    MALE   |  23.0 |    de    |
| 83529  | nxx0y70s3h |    MALE   |  27.0 |    ru    |
| 85076  | jqry7hv8us |   FEMALE  |  30.0 |    fi    |
| 86261  | fxk0o2piqx |    MALE   |  42.0 |    ko    |
| 86618  | 8dwjr6bhnq |   FEMALE  |  22.0 |    de    |
| 87586  | lql3y0u1u3 |   FEMALE  |  41.0 |    es    |
| 87860  | xft8nyld5s | -unknown- |  20.0 |    it    |
| 88014  | 00fn6wu77e |   FEMALE  |  27.0 |    de    |
| 88022  | r13gwqsa66 |   OTHER   |  44.0 |    en    |
| 88066  | mkd8qo897u |   OTHER   |  51.0 |    en    |
| 88603  | w68eu2asd4 |    MALE   |  36.0 |    es    |
| 88918  | l1s2ftl7hf |    MALE   |  27.0 |    ru    |
| 89327  | bhvr1c7q4k |   FEMALE  |  27.0 |    de    |
| 91202  | wox91filok |   FEMALE  |  27.0 |    sv    |
| 91279  | y00cramb1g |    MALE   |  44.0 |    el    |
| 91418  | 18g7gss5n9 | -unknown- |  31.0 |    ru    |
| 91446  | y3nndt4alt |    MALE   |  27.0 |    ru    |
| 92081  | rjynx8s8g8 |   OTHER   |  44.0 |    en    |
| 92603  | 2szv3d907w |    MALE   |  23.0 |    de    |
| 94586  | zl1qmq4qa7 |    MALE   |  20.0 |    id    |
| 94819  | 6rzjma0vmq |    MALE   |  35.0 |    cs    |
| 95037  | 0vhiuvupgj |    MALE   |  17.0 |    it    |
| 95115  | 1jmue0ct0g |   FEMALE  |  22.0 |    de    |
| 96202  | dolc5wcp15 |   FEMALE  |  35.0 |    th    |
| 96226  | n1d9o1nz58 |   FEMALE  |  31.0 |    th    |
| 96494  | uygi2kex5w |    MALE   | 101.0 |    en    |
| 96646  | 2mkf8vu4v4 |    MALE   |  92.0 |    en    |
| 96760  | wd8tpxs87a |    MALE   |  28.0 |    th    |
| 96777  | i9va17b12r | -unknown- |  32.0 |    ru    |
| 97069  | xhunwa3b0l |   FEMALE  |  27.0 |    de    |
| 97173  | wy1bxe3bnb |   FEMALE  |  41.0 |    es    |
| 97419  | nozrl9d436 |    MALE   |  94.0 |    en    |
| 97580  | vy7bf52b63 |    MALE   |  23.0 |    th    |
| 97621  | wyvbii39o1 |    MALE   | 101.0 |    en    |
| 98094  | a7q5y0tfht |   FEMALE  |  41.0 |    es    |
| 99610  | yxeehrr8jv |    MALE   |  36.0 |    es    |
| 100076 | 7gxvw7sflb |   FEMALE  |  62.0 |    ru    |
| 100657 | xxkdrpaffu |    MALE   |  25.0 |    ru    |
| 102474 | yb3c5u8ggx |   FEMALE  |  43.0 |    zh    |
| 103187 | t6bck4li4j |   OTHER   |  39.0 |    en    |
| 103301 | 67af8d7zcj |    MALE   |  92.0 |    en    |
| 103897 | jgx9bdcxih |    MALE   |  91.0 |    en    |
| 104831 | agy23vnsom | -unknown- |  26.0 |    zh    |
| 105084 | 832znjbicy |    MALE   | 108.0 |    en    |
| 105299 | jozq26g7ol |    MALE   |  39.0 |    fr    |
| 105507 | mch7jrjfsy |    MALE   | 104.0 |    it    |
| 106591 | f8v9hjple3 |    MALE   |  39.0 |    zh    |
| 106743 | h8q44jeteh |    MALE   |  71.0 |    ru    |
| 107322 | ozspjod2dy |    MALE   |  40.0 |    ko    |
| 108304 | x4k358fahh |   FEMALE  |  22.0 |    de    |
| 108321 | u5t34yo1gs |   FEMALE  |  88.0 |    en    |
| 108802 | kjv2snx9aj |   OTHER   |  36.0 |    en    |
| 109874 | t7xlg3jg7j |   FEMALE  |  43.0 |    zh    |
| 109906 | pt6f662hxb |    MALE   |  36.0 |    es    |
| 110070 | x447w4khpq |   FEMALE  |  19.0 |    de    |
| 110158 | y9snhnr4gp |    MALE   |  39.0 |    zh    |
| 111490 | 5s7stqy8cc |    MALE   |  43.0 |    tr    |
| 111763 | mty0558lp8 | -unknown- |  42.0 |    ko    |
| 112458 | 8izfetkj1u |    MALE   |  75.0 |    ru    |
| 112911 | cwp2jtnb1a |    MALE   |  27.0 |    ru    |
| 113166 | u6u4qll6wr |   OTHER   |  39.0 |    en    |
| 114022 | ew43x6pikv |   FEMALE  |  25.0 |    hu    |
| 115461 | sbz4jj4dw2 |   FEMALE  |  41.0 |    es    |
| 115566 | gv5bn08cd8 |   FEMALE  |  19.0 |    de    |
| 115668 | nqc3hwerk2 |    MALE   |  94.0 |    en    |
| 117742 | jkywcvqxp0 |   FEMALE  |  34.0 |    de    |
| 117781 | pej0es6pdc |   FEMALE  |  33.0 |    ru    |
| 117899 | xc4k98fhy8 |   FEMALE  |  43.0 |    zh    |
| 118267 | wyczz5vwhe |    MALE   |  25.0 |    ru    |
| 118294 | f5gg5hp4ne |   FEMALE  |  35.0 |    el    |
| 118694 | 1lqc712a7q |    MALE   |  42.0 |    ko    |
| 119047 | rlbey80etn |   OTHER   |  44.0 |    en    |
| 119662 | p6mq3z0m1i |   FEMALE  |  27.0 |    de    |
| 120236 | kzzt7oqzun |   FEMALE  |  43.0 |    zh    |
| 120240 | fkpwhf0xp6 |   FEMALE  |  35.0 |    th    |
| 120801 | yilb3fj7k4 |   FEMALE  |  19.0 |    de    |
| 121962 | 0sdd1z92v2 |    MALE   |  40.0 |    ko    |
| 122634 | kb1ja3xxln |    MALE   |  42.0 |    ko    |
| 123011 | 3737oc2uis |    MALE   |  42.0 |    ko    |
| 124267 | xxhpg5929w |   OTHER   |  36.0 |    en    |
+--------+------------+-----------+-------+----------+
[193 rows x 5 columns]


In [ ]: